What is the purpose of this project? Why should we care about it? What makes this a difficult exercise? What is your overall modeling strategy? Briefly summarize your results.
The Zestimate, Zillow’s algorithm for predicting home values, has increasingly become the predominant reference for buying and selling houses in the real estate market, often being mistaken for an appraisal tool. Consequently, inaccuracies in their model predictions can, now more than ever, critically affect homeowner’s personal equity and undermine reliability in the tool.
Our goal was to develop a Machine Learning pipeline for predicting house prices that is enhanced with additional layers of data, including local features specific to Boulder, Colorado, while also being tested and adjusted for spatial and socio-economic biases to ensure its generalizability, and thus giving it more predictive power.
However, this task presents two important challenges. First, we are not taking into account any previous sales or tax assessment data that could better inform our model. Second, we are conducting this analysis in Boulder County in Colorado, a county that is not itself a coherent urban unit but rather is composed of Denver’s outer suburbs, medium-sized towns and almost half of its area covered in forest. Furthermore, Boulder is a very homogeneous and particular county, with a 91% white population, a median income of $81,390 (compared to $62,843 nation-wide) and 61% of people 25 or older with higher education (compared to 31% nation-wide) in 2019 estimates.
We implemented a Linear Regression supervised Machine Learning model for predicting house prices (explained in more detail in part 3). We built up a pipeline where we can input data (we have previously cleaned and wrangled), evaluate it, take it out or leave it depending on its usefulness, process it and test it on home samples we have set apart for predicting, while getting metrics on the current effectiveness of the model. Each version of the model was saved separately and its effectiveness logged, so it can be plugged in or out as needed. This process can be then repeated many times until the best possible is obtained.
Briefly summarize your results
Briefly describe your methods for gathering the data.
We structured the data gathering process into five groups, according to their origin and function:
Given the richness of data from A and C, the data wrangling process focused on selecting what data is relevant for predicting house prices and which new variables could be produced from this raw data, and the results were ready to use. On the other hand, B and D were conditioned by what data is available and involved a more intensive understanding of the data, determining parameters on how to measure its possible effects on home prices. Data in the E group was treated more akin to that of B and D, but with a looser expectation of what its effects could be.
# --- A. HOME VALUE DATA ---
# read in home value data
data <- st_read("studentData.geojson") %>%
st_set_crs("ESRI:102254") %>%
st_transform(boulderCRS) %>%
# TODO: add this filter where it's relevant
filter(toPredict == 0)
# recode missing data and engineered features
homeRecodes <- data %>%
mutate(
# change year to factor from float
year = as.factor(year),
# calculate log of price to normalize positive skew
logPrice = log(price),
# recode missing construction material values
constMat = case_when(
ConstCode == 0 ~ "Missing",
ConstCode == 300 ~ "Unspecified",
ConstCode > 300 ~ as.character(ConstCodeDscr)
),
# recode basement as dummy
hasBasement = if_else(bsmtType == 0, 0, 1),
# recode car storage as garage dummy
hasGarage = if_else(str_detect(carStorageTypeDscr, "GARAGE"), 1, 0),
# recode a/c as dummy, excluding fans, evaporative coolers, and unspecified
hasAC = replace_na(if_else(Ac == 210, 1, 0), 0),
# recode missing heating values
heatingType = case_when(
is.na(Heating) ~ "None",
Heating == 800 ~ "Unspecified",
Heating > 800 ~ as.character(HeatingDscr)
),
# recode missing primary exterior wall values
extWall = if_else(ExtWallPrim == 0, "Missing", as.character(ExtWallDscrPrim)),
# recode missing secondary exterior wall values
extWall2 = if_else(is.na(ExtWallSec), "None", as.character(ExtWallDscrSec)),
# recode missing interior wall values
intWall = if_else(is.na(IntWall), "Missing", as.character(IntWallDscr)),
# recode missing roof cover values and combine those with few observations
roofType = case_when(
is.na(Roof_Cover) ~ "Missing",
Roof_Cover %in% c(1220, 1240, 1250, 1260, 1290) ~ "Other",
TRUE ~ as.character(Roof_CoverDscr)
),
# recode quality as numeric variable
qualityNum = case_when(
qualityCode == 10 ~ 1, # QualityCodeDscr == "LOW "
qualityCode == 20 ~ 2, # "FAIR "
qualityCode == 30 ~ 3, # "AVERAGE "
qualityCode == 31 ~ 4, # "AVERAGE + "
qualityCode == 32 ~ 5, # "AVERAGE ++ "
qualityCode == 40 ~ 6, # "GOOD "
qualityCode == 41 ~ 7, # "GOOD + "
qualityCode == 42 ~ 8, # "GOOD ++ "
qualityCode == 50 ~ 9, # "VERY GOOD "
qualityCode == 51 ~ 10, # "VERY GOOD + "
qualityCode == 52 ~ 11, # "VERY GOOD ++ "
qualityCode == 60 ~ 12, # "EXCELLENT "
qualityCode == 61 ~ 13, # "EXCELLENT + "
qualityCode == 62 ~ 14, # "EXCELLENT++ "
qualityCode == 70 ~ 15, # "EXCEPTIONAL 1 "
qualityCode == 80 ~ 16, # "EXCEPTIONAL 2 "
),
# recode missing construction material values
constMat = case_when(
ConstCode == 0 ~ "Missing",
ConstCode == 300 ~ "Unspecified",
ConstCode > 300 ~ as.character(ConstCodeDscr)
),
# recode missing primary exterior wall values
extWall = if_else(
is.na(ExtWallPrim) | ExtWallPrim == 0, "Missing",
as.character(ExtWallDscrPrim)
),
# recode builtYear as builtEra
builtEra = case_when(
builtYear < 1910 ~ "Pre-1910",
between(builtYear, 1910, 1919) ~ "1910s",
between(builtYear, 1920, 1929) ~ "1920s",
between(builtYear, 1930, 1939) ~ "1930s",
between(builtYear, 1940, 1949) ~ "1940s",
between(builtYear, 1950, 1959) ~ "1950s",
between(builtYear, 1960, 1969) ~ "1960s",
between(builtYear, 1970, 1979) ~ "1970s",
between(builtYear, 1980, 1989) ~ "1980s",
between(builtYear, 1990, 1999) ~ "1990s",
between(builtYear, 2000, 2009) ~ "2000s",
between(builtYear, 2010, 2019) ~ "2010s",
builtYear >= 2020 ~ "2020s"
),
# recode section_num as manySections
manySections = if_else(section_num > 1, 1, 0),
# recode design to remove all caps for table
designType = if_else(
designCode == "0120", "Multi-Story Townhouse", as.character(designCodeDscr)
),
)
# create clean data frame for modeling
homeData <- homeRecodes %>%
# drop extreme outliers identified as data entry errors
filter(!MUSA_ID %in% c(8735,1397,5258)) %>%
# drop unneeded columns
dplyr::select(
# same for all
-bldgClass,
-bldgClassDscr,
-status_cd,
# not needed
-saleDate,
-address,
-bld_num,
# redundant
-year,
# too much missing data
-Stories,
-UnitCount,
# cleaned
-designCode,
-qualityCode,
-ConstCode,
-ConstCodeDscr,
-bsmtType,
-bsmtTypeDscr,
-carStorageType,
-carStorageTypeDscr,
-Ac,
-AcDscr,
-Heating,
-HeatingDscr,
-ExtWallPrim,
-ExtWallDscrPrim,
-ExtWallSec,
-ExtWallDscrSec,
-IntWall,
-IntWallDscr,
-Roof_Cover,
-Roof_CoverDscr,
# recoded
-section_num,
-qualityCodeDscr,
-builtYear,
-designCodeDscr
)
# isolate home IDs to use in spatial joins
homeIDs <- data %>%
dplyr::select(MUSA_ID, geometry)
# B. --- BOUNDARY DATA ---
# import county limits for later reference
countyLimits <- st_read('County_Boundary.geojson') %>%
select(OBJECTID, geometry) %>%
st_transform(boulderCRS)
# import municipality limits for later reference
munis <- st_read('Municipalities.geojson') %>%
select(ZONEDESC, geometry) %>%
st_transform(boulderCRS)
# B1. "Neighborhoods" created from open data
# TODO: add other neighborhood code
# B2. Census tracts
# import Census tract boundaries as proxy for neighborhoods,
# plus tract median income for generalizability test
tracts <-
get_acs(geography = "tract",
variables = "B19013_001E", # median household income
year = year,
state = state,
county = county,
geometry = T,
output = "wide") %>%
dplyr::select(GEOID, B19013_001E, geometry)%>%
rename(tract = GEOID,
medianIncome = B19013_001E) %>%
st_transform(boulderCRS)
# isolate tract boundaries to join to home data
tractsData <- st_join(homeIDs, tracts) %>%
dplyr::select(-medianIncome) %>%
st_drop_geometry()
# --- C. AMERICAN COMMUNITY SURVEY DATA ---
# review available variables
# acsVariableList <- load_variables(year,"acs5", cache = TRUE)
# define variables to import
# TODO: probably get rid of <$100k income group; highly colinear
# TODO: maybe also get rid of occupancy, vacancy?
acsVars <- c("B02001_001E", # race: total
"B02001_002E", # race: white alone
'B25003_001E', # tenure: occupied housing units
'B25003_002E', # tenure: owner-occupied
'B25002_001E', # occupancy: total housing units
'B25002_003E', # occupancy: vacant housing units
'B15003_001E', # educational attainment: total
'B15003_022E', # educational attainment: bachelor's degree
'B15003_023E', # educational attainment: master's degree
'B15003_024E', # educational attainment: professional degree
'B15003_025E', # educational attainment: doctorate degree
'B19001_001E', # household income: total
'B19001_002E', # household income: less than $10k
'B19001_003E', # household income: $10-15k
'B19001_004E', # household income: $15-20k
'B19001_005E', # household income: $20-25k
'B19001_006E', # household income: $25-30k
'B19001_007E', # household income: $30-35k
'B19001_008E', # household income: $35-40k
'B19001_009E', # household income: $40-45k
'B19001_010E', # household income: $45-50k
'B19001_011E', # household income: $50-60k
'B19001_012E', # household income: $60-75k
'B19001_013E', # household income: $75-100k
'B19001_014E', # household income: $100-125k
'B19001_015E', # household income: $125-150k
'B19001_016E', # household income: $150-200k
'B19001_017E') # household income: $200 or more
# import variables from ACS 2019 5-year
blockGroups <-
get_acs(geography = "block group",
variables = acsVars,
year = year,
state = state,
county = county,
geometry = T,
output = 'wide') %>%
dplyr::select(-ends_with('M')) %>%
rename(# white population
raceTotal = B02001_001E, # race: total
whiteAlone = B02001_002E, # race: white alone
# vacant housing units
totalUnits = B25002_001E, # occupancy status: total
vacantUnits = B25002_003E, # occupancy status: vacant
# homeowners
occupiedUnits = B25003_001E, # tenure: total
ownerOccupied = B25003_002E, # tenure: owner-occupied
# highest educational attainment
eduTotal = B15003_001E, # educational attainment: total
eduBachs = B15003_022E, # educational attainment: bachelor's degree
eduMasts = B15003_023E, # educational attainment: master's degree
eduProfs = B15003_024E, # educational attainment: professional degree
eduDocts = B15003_025E, # educational attainment: doctorate degree
# household income
incomeTotal = B19001_001E, # household income: total
income000 = B19001_002E, # household income: less than $10k
income010 = B19001_003E, # household income: $10-15k
income015 = B19001_004E, # household income: $15-20k
income020 = B19001_005E, # household income: $20-25k
income025 = B19001_006E, # household income: $25-30k
income030 = B19001_007E, # household income: $30-35k
income035 = B19001_008E, # household income: $35-40k
income040 = B19001_009E, # household income: $40-45k
income045 = B19001_010E, # household income: $45-50k
income050 = B19001_011E, # household income: $50-60k
income060 = B19001_012E, # household income: $60-75k
income075 = B19001_013E, # household income: $75-100k
income100 = B19001_014E, # household income: $100-125k
income125 = B19001_015E, # household income: $125-150k
income150 = B19001_016E, # household income: $150-200k
income200 = B19001_017E # household income: $200k or more
)%>%
mutate(pctWhite = whiteAlone/raceTotal,
pctVacant = vacantUnits/totalUnits,
pctOwnerOccupied = ownerOccupied/occupiedUnits,
# calculate percent with bachelor's or higher
# TODO: compare percent postgraduate?
pctHigherEdu = if_else(
eduTotal > 0, (eduBachs + eduMasts + eduProfs + eduDocts)/eduTotal, 0
),
# calculate percent in each income category
pctIncome000 = income000/incomeTotal,
pctIncome010 = income010/incomeTotal,
pctIncome015 = income015/incomeTotal,
pctIncome020 = income020/incomeTotal,
pctIncome025 = income025/incomeTotal,
pctIncome030 = income030/incomeTotal,
pctIncome035 = income035/incomeTotal,
pctIncome040 = income040/incomeTotal,
pctIncome045 = income045/incomeTotal,
pctIncome050 = income050/incomeTotal,
pctIncome060 = income060/incomeTotal,
pctIncome075 = income075/incomeTotal,
pctIncome100 = income100/incomeTotal,
pctIncome125 = income125/incomeTotal,
pctIncome150 = income150/incomeTotal,
pctIncome200 = income200/incomeTotal,
# recode final income features after exploratory analysis
pctIncomeBelow100k = (
income000 + income010 + income015 + income020 + income025 +
income030 + income035 + income040 + income045 + income050 +
income060 + income075
)/incomeTotal,
pctIncomeAbove200k = pctIncome200
) %>%
select(GEOID, pctWhite, pctVacant, pctOwnerOccupied, pctHigherEdu,
pctIncomeBelow100k, pctIncomeAbove200k, geometry) %>%
rename(blockGroup = GEOID) %>%
st_transform(boulderCRS)
blockGroupBoundaries <- blockGroups %>%
select(blockGroup, geometry)
censusData <- st_join(homeIDs, blockGroupBoundaries) %>%
st_drop_geometry() %>%
left_join(., blockGroups, by="blockGroup") %>%
dplyr::select(-blockGroup, -geometry)
# --- E. EXPERIMENTAL OPEN DATA ---
# D3. Distance to recreational cannabis dispensaries
dispensaries <- st_read("Marijuana_Establishments.geojson") %>%
filter(str_detect(Description, "Store")) %>%
dplyr::select(OBJECTID, Type, geometry) %>%
st_sf() %>%
st_transform(boulderCRS)
dispensaryData <- homeIDs %>%
mutate(dispensaryDistance = nn_function(st_coordinates(.), st_coordinates(dispensaries), 1)) %>%
st_drop_geometry()
# D4. Distance to Whole Foods Markets stores
# read in raw address data
wholeFoodsRaw <- st_read("wholefoodsmarkets_boulderCO.csv")
# transform to sf object
wholeFoods <- st_as_sf(wholeFoodsRaw, coords = c("lon", "lat"), crs = 4326) %>%
dplyr::select(ID, geometry) %>%
st_sf() %>%
st_transform(boulderCRS)
# calculate nearest neighbor distance
wholeFoodsData <- homeIDs %>%
mutate(wholeFoodsDistance = nn_function(st_coordinates(.), st_coordinates(wholeFoods), 1)) %>%
st_drop_geometry()
# D3. Distance to Highways
# read in state highway data
coloradoHighways <- st_read('HighwaysByFunctionalClass.geojson') %>%
dplyr::select(OBJECTID, geometry) %>%
st_transform(boulderCRS) %>%
# crop to roads in or near Boulder County
st_crop(.,st_buffer(countyLimits, 8045)) %>%
# merge features to avoid distance matrix
st_union(.)
# calculate distance from highways and save
highwayData <- homeIDs %>%
mutate(highwayDistance = drop_units(st_distance(., coloradoHighways))) %>%
st_drop_geometry()
# E. --- COMBINE ALL ---
finalData <- left_join(homeData, censusData) %>%
left_join(., wildfireData) %>%
left_join(., floodData) %>%
left_join(., dispensaryData) %>%
left_join(., wholeFoodsData) %>%
left_join(., highwayData) %>%
left_join(., tractsData)
Present a table of summary statistics with variable descriptions. Sort these variables by their category (internal characteristics, amenities/public services or spatial structure). Check out the
stargazerpackage for this.
# TODO: label variables
# TODO: pick one of stargazer and sumtable
# remove non-variable columns
summaryStatsData <- finalData %>%
dplyr::select(-toPredict, -MUSA_ID) %>%
st_drop_geometry()
# create summary statistics data frame
# summaryStats <- sumtable(as.data.frame(finalData), add.median = T, out = "return")
# create polished summary statistics table
# summaryStats %>%
# kbl(
# caption = "Summary Statistics",
# digits = 2,
# format.args = list(big.mark = ","),
# knitr.kable.NA = "*"
# ) %>%
# kable_classic()
stargazer(summaryStatsData, type = "html")
| Statistic | N | Mean | St. Dev. | Min | Pctl(25) | Pctl(75) | Max |
| price | 11,261 | 745,787.000 | 541,949.300 | 10,000 | 452,700 | 831,000 | 7,350,000 |
| CompCode | 11,261 | 0.988 | 0.096 | 0 | 1 | 1 | 1 |
| EffectiveYear | 11,261 | 1,997.708 | 17.457 | 1,879 | 1,989 | 2,010 | 2,021 |
| bsmtSF | 11,261 | 744.272 | 627.702 | 0 | 137 | 1,126 | 4,534 |
| carStorageSF | 11,261 | 471.599 | 235.777 | 0 | 380 | 600 | 3,040 |
| nbrBedRoom | 11,261 | 3.445 | 1.051 | 0 | 3 | 4 | 24 |
| nbrRoomsNobath | 11,261 | 7.390 | 2.135 | 0 | 6 | 9 | 28 |
| mainfloorSF | 11,261 | 1,313.268 | 607.493 | 0 | 923 | 1,619 | 7,795 |
| nbrThreeQtrBaths | 11,261 | 0.617 | 0.730 | 0 | 0 | 1 | 9 |
| nbrFullBaths | 11,261 | 1.782 | 0.872 | 0 | 1 | 2 | 12 |
| nbrHalfBaths | 11,261 | 0.555 | 0.540 | 0 | 0 | 1 | 3 |
| TotalFinishedSF | 11,261 | 1,954.805 | 891.584 | 0 | 1,278 | 2,443 | 10,188 |
| logPrice | 11,261 | 13.364 | 0.536 | 9.210 | 13.023 | 13.630 | 15.810 |
| hasBasement | 11,261 | 0.771 | 0.420 | 0 | 1 | 1 | 1 |
| hasGarage | 11,261 | 0.897 | 0.304 | 0 | 1 | 1 | 1 |
| hasAC | 11,261 | 0.634 | 0.482 | 0 | 0 | 1 | 1 |
| qualityNum | 11,261 | 5.535 | 2.446 | 1 | 3 | 7 | 16 |
| manySections | 11,261 | 0.007 | 0.081 | 0 | 0 | 0 | 1 |
| pctWhite | 11,261 | 0.899 | 0.054 | 1 | 0.9 | 0.9 | 1 |
| pctVacant | 11,261 | 0.051 | 0.095 | 0.000 | 0.000 | 0.068 | 0.844 |
| pctOwnerOccupied | 11,261 | 0.736 | 0.179 | 0.000 | 0.622 | 0.884 | 1.000 |
| pctHigherEdu | 11,261 | 0.634 | 0.164 | 0.000 | 0.533 | 0.739 | 0.948 |
| pctIncomeBelow100k | 11,261 | 0.501 | 0.164 | 0.112 | 0.385 | 0.617 | 1.000 |
| pctIncomeAbove200k | 11,261 | 0.173 | 0.114 | 0.000 | 0.085 | 0.258 | 0.551 |
| wildfireHistory | 11,261 | 0.487 | 1.324 | 0 | 0 | 0 | 9 |
| floodRisk | 11,261 | 0.040 | 0.220 | 0 | 0 | 0 | 2 |
| dispensaryDistance | 11,261 | 5,465.989 | 3,911.418 | 211.542 | 2,956.374 | 7,068.126 | 29,113.330 |
| wholeFoodsDistance | 11,261 | 5,790.450 | 4,557.740 | 98.396 | 2,739.527 | 7,459.678 | 35,763.420 |
| highwayDistance | 11,261 | 1,205.426 | 970.655 | 12.755 | 477.797 | 1,728.735 | 7,929.964 |
Present a correlation matrix
# select numeric variables
numericVars <- select_if(st_drop_geometry(finalData), is.numeric) %>%
dplyr::select(
# omit for more legible chart
-toPredict,
-MUSA_ID,
-logPrice) %>%
na.omit()
# generate correlation matrix
ggcorrplot(
round(cor(numericVars), 1),
p.mat = cor_pmat(numericVars),
show.diag = T,
colors = c("#25cb10", "#ffffff", "#fa7800"),
type = "lower",
insig = "blank",
hc.order = T
) +
labs(title = "Correlation across numeric variables") +
plotTheme()
Present 4 home price correlation scatterplots that you think are of interest. I’m going to look for interesting open data that you’ve integrated with the home sale observations.
# TODO: select four variables to plot:
# probably pctIncomeAbove200k, pctHigherEdu, wildfireHistory, floodRisk?
correlationVars <- c("pctHigherEdu", "pctIncomeAbove200k", "pctIncomeBelow100k", "wildfireHistory", "floodRisk")
# create scatterplots
st_drop_geometry(finalData) %>%
dplyr::select(price, starts_with("pct")) %>%
pivot_longer(cols = !price, names_to = "Variable", values_to = "Value") %>%
ggplot(aes(Value, price)) +
geom_point(size = 0.5) +
geom_smooth(method = "lm", color = "#fa7800") +
scale_y_continuous(labels = scales::dollar_format()) +
facet_wrap(~Variable, ncol = 4, scales = "free_x") +
labs(title = "Price as a function of numeric variables") +
plotTheme()
Develop 1 map of your dependent variable (sale price)
# TODO: display these side by side with gridxtra?
# --- 1. sale price by Census tract ---
# calculate mean sale price by tract
priceByTract <- finalData %>%
dplyr::select(tract, price) %>%
st_drop_geometry() %>%
group_by(tract) %>%
summarize(
meanPrice = mean(price),
medianPrice = median(price)
) %>%
left_join(., tracts) %>%
dplyr::select(-medianIncome) %>%
st_sf()
ggplot() +
geom_sf(
data = priceByTract,
aes(fill = meanPrice),
lwd = 0.1,
color = "black",
alpha = 0.7
) +
scale_fill_viridis_c("Mean sale price",
direction = -1,
option = "G") +
geom_sf(data = countyLimits, fill = "transparent") +
labs(
title = "Spatial distribution of home prices",
subtitle = "Mean sale price by Census tract",
caption = "Fig. X.X" # TODO: number figure before finalizing
) +
mapTheme()
# --- 2. individual home sale prices ---
# map individual home prices
ggplot() +
geom_sf(data = countyLimits, fill = "gray99") +
geom_sf(
data = homeData,
size = 1,
aes(color = price)
) +
geom_sf(data = munis, fill = "transparent", lwd = 0.5) +
scale_color_viridis_c("Sale price",
direction = -1,
option = "D") +
geom_sf(data = countyLimits, fill = "transparent") +
labs(
title = "Spatial distribution of sale prices",
subtitle = "Individual homes relative to county and municipality boundaries",
caption = "Fig. X.X" # TODO: number figure before finalizing
) +
mapTheme()
# --- combined plot ---
grid.arrange(ncol = 2,
ggplot() +
geom_sf(
data = priceByTract,
aes(fill = meanPrice),
lwd = 0.1,
color = "black",
alpha = 0.7
) +
scale_fill_viridis_c("Mean price",
direction = -1,
option = "G",
labels = scales::dollar_format()) +
geom_sf(data = countyLimits, fill = "transparent") +
labs(
title = "Spatial distribution of home prices",
subtitle = "Mean sale price by Census tract",
caption = "Fig. X.X" # TODO: number figure before finalizing
) +
mapTheme(),
ggplot() +
geom_sf(data = countyLimits, fill = "gray99") +
geom_sf(
data = homeData,
size = 1,
aes(color = price)
) +
geom_sf(data = munis, fill = "transparent", lwd = 0.5) +
scale_color_viridis_c("Price",
direction = -1,
option = "D",
labels = scales::dollar_format()) +
geom_sf(data = countyLimits, fill = "transparent") +
labs(
title = "Spatial distribution of sale prices",
subtitle = "Individual homes with municipal boundaries",
caption = "Fig. X. X" # number figure before finalizing
) +
mapTheme()
)
Develop 3 maps of 3 of your most interesting independent variables.
# map wildfires in last 20 years
ggplot() +
geom_sf(data = countyLimits, fill = "gray20", color = "gray10", size = 0.5) +
geom_sf(data = tracts, fill = "transparent", color = "gray10") +
geom_sf(data = finalData, aes(color = wildfireHistory), alpha = 0.7, size = 1.5) +
scale_color_viridis_c("Fires", option = "B", breaks = c(0, 2, 4, 6, 8)) +
labs(title = "Wildfires within two miles in last 20 years",
subtitle = "Individual homes relative to Census tracts",
caption = "Fig. X.X") + # TODO: number before finalizing
mapTheme()
# TODO: try all distances logged and unlogged; make sure consistent and correctly labeled in final
# map distance to dispensaries
ggplot() +
geom_sf(data = countyLimits, fill = "gray98", color = "gray85", lwd = 0.5) +
geom_sf(data = tracts, fill = "transparent", color = "gray90") +
geom_sf(data = finalData, aes(color = dispensaryDistance)) +
scale_color_viridis_c("Distance \nin meters", option = "D", direction = -1, label = scales::comma) +
geom_sf(data = dispensaries, aes(fill = Type), size = 3, shape = 25) +
scale_fill_manual("", labels = "Dispensary", values = "black") +
labs(title = "Distance to recreational cannabis dispensary",
subtitle = "Individual homes with Census tracts and dispensary locations",
caption = "Fig. X.X") + # TODO: number before finalizing
mapTheme()
# map distance from nearest highway
ggplot() +
geom_sf(data = countyLimits, fill = "gray98", color = "gray95", lwd = 0.5) +
geom_sf(data = st_crop(coloradoHighways, countyLimits), color = "gray90", lwd = 3) +
geom_sf(data = finalData, aes(color = highwayDistance)) +
scale_color_viridis_c("Distance \nin meters", option = "E", label = scales::comma) +
labs(title = "Distance from nearest highway",
subtitle = "Individual homes relative to highway network",
caption = "Fig. X.X") +
mapTheme()
Include any other maps/graphs/charts you think might be of interest.
Briefly describe your method (remember who your audience is).
Developing and refining a Machine Learning model is not a linear process but rather a cyclical one. Nevertheless, there is a series of continuous steps that are laid out to test each version of the model.
First, we look into the correspondence (or correlation) between the sample home prices and each of the input (or dependent) variables. We visualize this as individual scatter plots or as an overall matrix diagram that summarizes correlations among all variables, so we can get an initial sense of which data is useful and which is too redundant.
Second, we perform a linear regression: a method for determining how much home prices change by each additional unit of the input variables, informing us of not only their individual effectiveness but also of their aggregate power. This way we can further evaluate which data should make it into our model.
The next step, cross validation, divides the data into two groups: a ‘training’ set that establishes the coefficients between variables and known house prices, and a ‘test’ set on which to predict house prices using these coefficients. By comparing the predicted values with the real ones, we measure the amount of error in our model, meaning how wrong our predictions are, or alternatively, how precisely can our model predict. At this point we also check for outliers that may have a significantly extreme error in price and, using other tools like Google Maps, we assess if they should be excluded from our model.
On top of that, we test our model in various ways to check whether it predicts significantly better in some places than in others, by looking for errors clustered in space. The first test, called spatial lag, averages the price error of each sample with its neighbors. The second one, Moran’s I, is used to prove that there is an underlying spatial structure that clusters together similar prices, instead of prices just being randomly distributed. Finally, we check for neighborhood effects, comparing the amount of error if neighborhoods are accounted for or not.
Finally, we tested the generalizability of our model by comparing the distribution of the errors left in the model with the income context, namely, a map of the neighborhoods where most people have an income below the county median versus those where most people have an income above it.
Split the ‘toPredict’ == 0 into a separate training and test set using a 75/25 split.
Provide a polished table of your (training set) lm summary results (coefficients, R2 etc).
Because each of Boulder County’s 68 Census tracts appears as a separate categorical variable in the regression results, we include the full results table as an Appendix at the end of this document.
# select regression data
regData <- finalData %>%
filter(toPredict == 0)
# split data into training (75%) and validation (25%) sets
inTrain <- createDataPartition(
y = paste(
regData$constMat,
regData$heatingType,
regData$extWall,
regData$extWall2,
regData$tractID
),
p = 0.75, list = FALSE)
homes.training <- regData[inTrain,]
homes.test <- regData[-inTrain,]
# estimate model on training set
reg.training <- lm(logPrice ~ .,
data = st_drop_geometry(regData) %>%
dplyr::select(-toPredict, -MUSA_ID, -price)
)
# view results
# TODO: remove before finalizing
# summary(reg.training)
Provide a polished table of mean absolute error and MAPE for a single test set. Check out the “kable” function for markdown to create nice tables.
# generate predictions and calculate errors
homes.test <- homes.test %>%
mutate(
Regression = "Census tract effect",
logPrice.Predict = predict(reg.training, homes.test),
price.Predict = exp(logPrice.Predict),
price.Error = price.Predict - price,
price.AbsError = abs(price.Predict - price),
price.APE = (abs(price.Predict - price)/price.Predict)
)
# calculate MAE and MAPE
errorTable <- data.frame(
MAE = scales::dollar(mean(homes.test$price.AbsError), accuracy = 0.01),
MAPE = scales::percent(mean(homes.test$price.APE), accuracy = 0.01)
)
# generate polished table
errorTable %>%
kbl(
caption = "Mean absolute error and mean absolute percent error",
digits = 3,
format.args = list(big.mark = ",")
) %>%
kable_classic(full_width = F)
| MAE | MAPE |
|---|---|
| $98,997.57 | 12.55% |
Provide the results of your cross-validation tests. This includes mean and standard deviation MAE. Do 100 folds and plot your cross-validation MAE as a histogram. Is your model generalizable to new data?
# perform k-fold cross-validation
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)
# k-fold model training
reg.cv <-
train(
logPrice ~ .,
data = st_drop_geometry(homes.training) %>%
dplyr::select(-toPredict, -MUSA_ID, -price),
method = "lm",
trControl = fitControl,
na.action = na.omit
)
reg.cv
# get MAE for all 100 folds
allMAE.df <- data.frame(MAE = reg.cv$resample[,3])
# plot MAE distribution as histogram
ggplot() +
geom_histogram(data = allMAE.df,
aes(x = MAE),
fill = "#fa7800",
color = "#ffffff",
binwidth = 0.002) +
scale_x_continuous(breaks = round(seq(min(allMAE.df$MAE), max(allMAE.df$MAE), by = 0.02), 2)) +
labs(title = "Distribution of mean absolute error",
subtitle = "k-fold cross-validation; k = 100",
caption = "Fig. X.X",
x = "MAE",
y = "Folds") + # TODO: number before finalizing
plotTheme()
# calculate mean and standard deviation MAE
validationTable <- data.frame(Mean = round(mean(allMAE.df$MAE), 3),
StdDev = round(sd(allMAE.df$MAE), 3))
# generate polished table
validationTable %>%
kbl(
caption = "MAE across 100 folds"
) %>%
kable_classic(full_width = F)
Plot predicted prices as a function of observed prices
# TODO: note low accuracy on the high end in writeup; discuss possible reasons
# plot predicted prices as function of observed prices
ggplot(data = homes.test,
aes(x=price,
y=price.Predict)) +
geom_point(color = "#000000") +
geom_abline(color = "#fa7800", size = 1) +
geom_smooth(method = "lm", color = "#25cb10") +
scale_x_continuous(labels = scales::dollar_format()) +
scale_y_continuous(labels = scales::dollar_format()) +
labs(title = "Predicted sale price as a function of observed price",
subtitle = "Orange line represents a perfect prediction; green line represents average prediction",
x = "Price",
y = "Predicted price") +
plotTheme()
# filter to homes sold for less than $1 million
homes.test.sub1mil <- homes.test %>%
filter(price < 1000000) %>%
mutate(
price.Error = price.Predict - price,
price.AbsError = abs(price.Predict - price),
price.APE = (abs(price.Predict - price)/price.Predict)
)
# filter to homes sold for more than $1 million
homes.test.over1mil <- homes.test %>%
filter(price >= 2000000) %>%
mutate(
price.Error = price.Predict - price,
price.AbsError = abs(price.Predict - price),
price.APE = (abs(price.Predict - price)/price.Predict)
)
grid.arrange(ncol = 2,
ggplot(data = homes.test.sub1mil,
aes(x=price,
y=price.Predict)) +
geom_point(color = "#000000") +
geom_abline(color = "#fa7800", size = 1) +
geom_smooth(method = "lm", color = "#25cb10") +
scale_x_continuous(labels = scales::dollar_format()) +
scale_y_continuous(labels = scales::dollar_format()) +
labs(title = "Predicted price as function of observed price, \nhomes under $1 million",
subtitle = "Orange line represents a perfect prediction; \ngreen line represents average prediction",
x = "Price",
y = "Predicted price") +
plotTheme(),
ggplot(data = homes.test.over1mil, aes(x=price, y=price.Predict)) +
geom_point(color = "#000000") +
geom_abline(color = "#fa7800", size = 1) +
geom_smooth(method = "lm", color = "#25cb10") +
scale_x_continuous(labels = scales::dollar_format()) +
scale_y_continuous(labels = scales::dollar_format()) +
labs(title = "Predicted price as function of observed price, \nhomes over $2 million",
subtitle = "Orange line represents a perfect prediction; \ngreen line represents average prediction",
x = "Price",
y = "Predicted price") +
plotTheme()
)
# calculate MAE and MAPE for less expensive homes
errorTable.sub1mil <- data.frame(MAE = scales::dollar(mean(homes.test.sub1mil$price.AbsError), accuracy = 0.01),
MAPE = scales::percent(mean(homes.test.sub1mil$price.APE), accuracy = 0.01))
# calculate MAE and MAPE for very expensive homes
errorTable.over1mil <- data.frame(
MAE = scales::dollar(
mean(homes.test.over1mil$price.AbsError), accuracy = 0.01),
MAPE = scales::percent(mean(homes.test.over1mil$price.APE), accuracy = 0.01))
# generate polished table for less expensive homes
errorTable.sub1mil %>%
kbl(
caption = "Mean absolute error and mean absolute percent error, homes under $1 million",
digits = 3,
format.args = list(big.mark = ",")
) %>%
kable_classic(full_width = F, position = "float_left")
# generate polished table for very expensive homes
errorTable.over1mil %>%
kbl(
caption = "Mean absolute error and mean absolute percent error, homes over $2 million",
digits = 3,
format.args = list(big.mark = ",")
) %>%
kable_classic(full_width = F, position = "right")
Provide a map of your residuals for your test set. Include a Moran’s I test and a plot of the spatial lag in errors.
# --- Residuals ---
# map residuals for test set
ggplot() +
geom_sf(data = countyLimits, fill = "gray99") +
geom_sf(
data = homes.test %>%
arrange(price.AbsError),
size = 2,
aes(color = price.Error,
alpha = price.AbsError),
) +
geom_sf(data = munis, fill = "transparent", size = 0.5) +
scale_color_gradient("Prediction error",
low = "#fa7800",
high = "#25cb10",
label = scales::dollar_format()) +
scale_alpha_continuous(range = c(0.2, 1),
guide = "none") +
geom_sf(data = countyLimits, fill = "transparent") +
labs(
title = "Spatial distribution of prediction errors",
subtitle = "Individual homes relative to county and municipal boundaries",
caption = "Fig. X.X" # TODO: number figure before finalizing
) +
mapTheme()
# --- Spatial Lag ---
coords.test <- st_coordinates(homes.test)
# Take value of neighbors for weighted matrix
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
# Create spatial weight matrix
spatialWeights.test <- nb2listw(neighborList.test, style="W")
homes.test <- homes.test %>%
mutate(
# TODO: confirm this is right
lagPrice = lag.listw(spatialWeights.test, price),
lagPriceError = lag.listw(spatialWeights.test, price.Error)
)
# ggplot(data = homes.test, aes(x=lagPrice, y=price)) +
# geom_point(color = "#fa7800") +
# geom_smooth(method = "lm", color = "#25cb10") +
# scale_x_continuous(labels = scales::dollar_format()) +
# scale_y_continuous(labels = scales::dollar_format()) +
# labs(title = "Price as a function of the spatial lag of price",
# x = "Spatial lag of price (Mean price of 5 nearest neighbors)",
# y = "Price") +
# plotTheme()
#
#
# ggplot(data = homes.test, aes(x=lagPriceError, y=price.Error)) +
# geom_point(color = "#fa7800") +
# geom_smooth(method = "lm", color = "#25cb10") +
# scale_x_continuous(labels = scales::dollar_format()) +
# scale_y_continuous(labels = scales::dollar_format()) +
# labs(title = "Error as a function of the spatial lag of error",
# x = "Spatial lag of errors (Mean error of 5 nearest neighbors)",
# y = "Error") +
# plotTheme()
grid.arrange(ncol = 2,
# plot price by spatial lag of price
ggplot(data = homes.test, aes(x=lagPrice, y=price)) +
geom_point(color = "#fa7800") +
geom_smooth(method = "lm", color = "#25cb10") +
scale_x_continuous(labels = scales::dollar_format()) +
scale_y_continuous(labels = scales::dollar_format()) +
labs(title = "Price as function of spatial lag of price",
x = "Spatial lag of price \n(Mean price of 5 nearest neighbors)",
y = "Price") +
plotTheme(),
# plot error by spatial lag of error
ggplot(data = homes.test, aes(x=lagPriceError, y=price.Error)) +
geom_point(color = "#fa7800") +
geom_smooth(method = "lm", color = "#25cb10") +
scale_x_continuous(labels = scales::dollar_format()) +
scale_y_continuous(labels = scales::dollar_format()) +
labs(title = "Error as function of spatial lag of error",
x = "Spatial lag of errors \n(Mean error of 5 nearest neighbors)",
y = "Error") +
plotTheme()
)
# --- Moran's I ---
# Run the actual Moran's I
moranTest <- moran.mc(homes.test$price.Error,
spatialWeights.test, nsim = 999)
# Plot of the Observed Moran's I and the permutations Moran's I distribution
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moranTest$statistic), colour = "#fa7800",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title = "Observed and permuted Moran's I",
subtitle = "Observed Moran's I in orange",
caption = "Fig. X.X",
x="Moran's I",
y="Count") +
plotTheme()
# TODO: figure out what on earth is going on here
left_join(
st_drop_geometry(homes.test) %>%
group_by(tract) %>%
summarize(meanPrice = mean(price, na.rm = T)),
mutate(homes.test, predict.fe =
predict(lm(price ~ tract, data = homes.test),
homes.test)) %>%
st_drop_geometry %>%
group_by(tract) %>%
summarize(meanPrediction = mean(predict.fe))) %>%
# TODO: add number formatting from other tables
kbl(
digits = 3,
format.args = list(big.mark = ",")
) %>% kable_classic()
Provide a map of your predicted values for where ‘toPredict’ is both 0 and 1.
# --- generate predictions ---
# set regression data set to include both toPredict == 0 and toPredict == 1
regData <- finalData
# create new training and test sets
all.homes.training <- filter(regData, toPredict == 0)
all.homes.test <- filter(regData, toPredict == 1)
# estimate model on training set
final.reg.training <- lm(logPrice ~ .,
data = st_drop_geometry(regData) %>%
dplyr::select(-toPredict, -MUSA_ID, -price),
na.action = na.exclude
)
# predict home prices
finalData.predictions <- finalData %>%
mutate(
logPrice.Predict = predict(final.reg.training, finalData),
price.Predict = exp(logPrice.Predict),
price.Error = price.Predict - price,
price.AbsError = abs(price.Predict - price),
price.APE = (abs(price.Predict - price)/price.Predict)
)
# map predicted home prices
ggplot() +
geom_sf(data = countyLimits, fill = "gray99") +
geom_sf(
data = finalData.predictions %>%
arrange(price.Predict),
size = 1.5,
aes(color = price.Predict), alpha = 0.5
) +
geom_sf(data = munis, fill = "transparent", lwd = 0.5) +
scale_color_viridis_c("Predicted price",
direction = -1,
option = "D",
labels = scales::dollar_format()) +
geom_sf(data = countyLimits, fill = "transparent") +
labs(
title = "Spatial distribution of sale prices",
subtitle = "Individual homes relative to municipal boundaries",
caption = "Fig. X.X" # TODO: number figure before finalizing
) +
mapTheme()
Using the test set predictions, provide a map of mean absolute percentage error (MAPE) by neighborhood.
# 6. NEIGHBORHOOD EFFECTS
# Calculate a baseline regression without the neighborhoods
reg.baseline <- lm(logPrice ~ ., data = as.data.frame(st_drop_geometry(homes.training)) %>%
dplyr::select(-toPredict, -MUSA_ID, -price, -tract))
summary(reg.baseline)
# st_drop_geometry(regData)
homes.test.baseline <-
homes.test %>%
dplyr::select(-tract) %>%
mutate(Regression = "No spatial predictor",
logPrice.Predict = predict(reg.baseline, homes.test),
price.Predict = exp(logPrice.Predict),
price.Error = price.Predict- price,
price.AbsError = abs(price.Predict- price),
price.APE = (abs(price.Predict- price)) / price) %>%
left_join(., tractsData)
mean(homes.test.baseline$price.AbsError) # MAE
mean(homes.test.baseline$price.APE) # MAPE
# TODO: rejoin tracts to homes.test.baseline?
bothRegressions <- rbind(
dplyr::select(homes.test, starts_with("price"), Regression, tract) %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, price.Error)),
dplyr::select(homes.test.baseline, starts_with("price"), Regression, tract) %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, price.Error))
)
# map MAPE by neighborhood
# TODO: format map to match others
st_drop_geometry(bothRegressions) %>%
group_by(Regression, tract) %>%
summarize(mean.MAPE = mean(price.APE, na.rm = T)) %>%
ungroup()%>%
left_join(tracts) %>%
st_sf() %>%
mutate(Regression = fct_reorder(Regression, mean.MAPE, .desc = T)) %>%
ggplot() +
geom_sf(aes(fill = mean.MAPE), color = "gray30", size = 0.2) +
geom_sf(data = countyLimits, fill = "transparent",
color = "gray25", size = 0.5) +
# geom_sf(data = bothRegressions, colour = "black", size = .5) +
facet_wrap(~Regression) +
scale_fill_viridis_c("MAPE", option = "G", direction = -1, alpha = 0.5) +
# scale_fill_gradient(low = palette5[1], high = palette5[5], name = "MAPE") +
labs(title = "Mean test set MAPE by neighborhood") +
labs(title = "Mean absolute percent error by Census tract",
subtitle = "With and without \"neighborhood\" predictor",
caption = "Fig. X.X") + # TODO: number before finalizing
mapTheme()
# # calculate MAE and MAPE
# errorTable <- data.frame(
# MAE = scales::dollar(mean(homes.test$price.AbsError), accuracy = 0.01),
# MAPE = scales::percent(mean(homes.test$price.APE), accuracy = 0.01)
# )
#
# # generate polished table
# errorTable %>%
# kbl(
# caption = "Mean absolute error and mean absolute percent error",
# digits = 3,
# format.args = list(big.mark = ",")
# ) %>%
# kable_classic(full_width = F)
# TODO: create this map for other neighborhoods
# Produce a table comparing non-neighborhood effects of a 'Baseline' regression
# TODO: match formatting to other tables
st_drop_geometry(bothRegressions) %>%
gather(Variable, Value, -Regression, -tract) %>%
filter(Variable == "price.AbsError" | Variable == "price.APE") %>%
group_by(Regression, Variable) %>%
summarize(meanValue = mean(Value, na.rm = T)) %>%
spread(Variable, meanValue) %>%
mutate(
price.AbsError = scales::dollar(price.AbsError, accuracy = 0.01),
price.APE = scales::percent(price.APE, accuracy = 0.01)
) %>%
arrange(desc(price.AbsError)) %>%
kable(
digits = 3,
format.args = list(big.mark = ",")
) %>%
kable_classic(full_width = F)
Provide a scatterplot plot of MAPE by neighborhood as a function of mean price by neighborhood.
# plot MAPE by neighborhood as function of mean price by neighborhood
# TODO: figure out plot order and axis tick labels
bothRegressions %>%
dplyr::select(price.Predict, price, Regression) %>%
ggplot(aes(price, price.Predict)) +
geom_point() +
stat_smooth(aes(price, price),
method = "lm", se = FALSE, size = 1, colour="#FA7800") +
stat_smooth(aes(price.Predict, price),
method = "lm", se = FALSE, size = 1, colour="#25CB10") +
# scale_x_continuous(scales::dollar_format()) +
# scale_y_continuous(scales::dollar_format()) +
facet_wrap(~Regression) +
labs(title = "Predicted sale price as a function of observed price",
subtitle = "Orange line represents a perfect prediction; Green line represents prediction",
caption = "Fig. X.X",
x = "Price",
y = "Predicted price") +
plotTheme()
Using tidycensus, split your city into two groups (perhaps by race or income) and test your model’s generalizability. Is your model generalizable?
# 7. GENERALIZABILITY TEST
# divide Census tracts by income
incomeTest <- tracts %>%
mutate(incomeContext = if_else(medianIncome < 83019, "Below median income", "Above median income"))
# Plot the income groups division
# TODO: figure out WHY the title is centered on this one
ggplot() +
geom_sf(data = incomeTest,
aes(fill = incomeContext),
color = "gray20",
alpha = 0.6,
size = 0.2) +
scale_fill_manual("Income Context",
values = c("#25cb10", "#fa7800")) +
geom_sf(data = countyLimits,
fill = "transparent",
color = "gray40",
size = 0.5) +
labs(title = "Classification of Census tracts by income context",
subtitle = "Tracts above and below $83,019 median household income in 2019",
caption = "Fig. X.X") +
mapTheme() +
theme(legend.position="bottom",
plot.title = element_text(hjust = 0))
# test generalizability by income context
st_join(bothRegressions, incomeTest) %>%
group_by(Regression, incomeContext) %>%
summarize(mean.MAPE = scales::percent(mean(price.APE, na.rm = T))) %>%
st_drop_geometry() %>%
spread(incomeContext, mean.MAPE) %>%
kable(., caption = "Test set MAPE by household income context") %>%
kable_classic()
# Replace with NEW INCOME DATA
# incomeTest <- blockGroups %>%
# select(
# pctIncome000,
# pctIncome010,
# pctIncome015,
# pctIncome020,
# pctIncome025,
# pctIncome030,
# pctIncome035,
# pctIncome040,
# pctIncome045,
# pctIncome050,
# pctIncome060,
# pctIncome075,
# pctIncome100,
# pctIncome125,
# pctIncome150,
# pctIncome200
# ) %>%
# mutate(lowIncome =
# pctIncome000+
# pctIncome010+
# pctIncome015+
# pctIncome020+
# pctIncome025+
# pctIncome030+
# pctIncome035+
# pctIncome040+
# pctIncome045+
# pctIncome050+
# pctIncome060+
# pctIncome075,
# highIncome=
# pctIncome100,
# pctIncome125,
# pctIncome150,
# pctIncome200) %>%
# mutate(incomeContext = ifelse((lowIncome) > .5, "Low Income", "High Income")) %>%
# select(-starts_with('pctIncome'), -lowIncome, -highIncome)
#
#
# # Plot the income groups division
# grid.arrange(ncol = 1,
# ggplot() + geom_sf(data = na.omit(incomeTest), aes(fill = incomeContext)) +
# scale_fill_manual(values = c(palette5[4], palette5[3], palette5[2]), name="Income Context") +
# #scale_discrete(limits=dat$V1) +
# labs(title = "Income Context") +
# mapTheme() + theme(legend.position="bottom"))
Is this an effective model? What were some of the more interesting variables? How much of the variation in prices could you predict? Describe the more important features? Describe the error in your predictions? According to your maps, could you account the spatial variation in prices? Where did the model predict particularly well? Poorly? Why do you think this might be?
We consider this to be an effective model because by just using open, easy to access data, it can predict home prices in Boulder County with a confidence of _____ .
What were some of the most interesting variables? [ADD A THIRD ONE?]<<<<
In the hope of assessing if the increasing effects of climate change have had any effects on the prices of houses in Boulder, we developed two natural disaster-based evaluation variables from data in Boulder County’s Open Data website.
First, we looked into Floodplains Hazard data. In 2013, Boulder was hit by a flood event that destroyed 2000 homes and left at least $4 billion in damage. This catastrophe led the state to create the Colorado Hazard Mapping Program (CHAMP), in order to produce more precise flood hazard prediction models, of which the data we incorporated is extracted.
Additionally, we seeked to assess the risk of exposure to the increasingly frequent wildfires in Colorado, using the historical wildfire perimeters in the County, which could be controlled not only by the distance of houses from the combined wildfire extensions, but also by how far back in time events are taken into account (10-, 15- or 20-years ago). We did so by counting the number of wildfire events that have occured from a two-mile distance from a home.
[map that shows wildfire data] <
How much of the variation in prices could you predict? <
R squared, Abs error, mAPE.???
Describe the more important features? <
House quality, pct income group? HHvacancy-why?.
Describe the error in your predictions? <
Where is it, how is it distributed in price and space? <
[2x2 maps showing errors in the baseline model vs. errors with ad hoc ‘subcommunities’ as neighborhoods, vs. ‘census tracts’ as neighborhoods]<
Boulder’s Open Data web database does not provide off-the-shelf county-wide neighborhood boundaries, so in order to look into the spatial variation in prices county-wide, we created our own ‘neighborhoods’ by joining a Boulder city ‘subcommunities’ map into the city zoning map and then merging it with the county wide municipalities and unincorporated areas map.
Because of the ad-hoc nature of these boundaries, we tested how well they performed versus the ‘census tracts as neighborhoods’ approach, acknowledging that doing so introduces a certain amount of scale bias into the model. In the end, the census tracts proved to be more effective on a county-wide level because they better differentiate between areas that in our ‘neighborhoods’ were all encompassed as “agricultural” or “forest” areas.
Since we chose census tracts as neighborhoods, we tried to reduce redundancies and spatial biases (or MAUP) by aggregating the census data on the more granular block group level.
Where did the model predict particularly well? Poorly? Why do you think this might be?<
Would you recommend your model to Zillow? Why or why not? How might you improve this model?
We would confidently recommend our model to Zillow, but we would make sure to note its limitations. For reasons mentioned before, we consider that a model trained on data from an area as homogeneous and arbitrarily defined as Boulder County is going to inevitably bear a great amount of selection and scale bias that may make it unreliable and not generalizable in more demographically diverse geographies.
Nonetheless, we believe our model has a smart approach at selecting and transforming the right internal and external characteristics and services that influence house values, and on which it could be easy to build on to better predict on a more diverse study area. For example, an optimized version of our model could instead be trained on aggregate data from the ten counties that form the Denver-Aurora-Lakewood Combined Statistical Area, of which Boulder is a part of.
Additionally, the model can greatly benefit from applying other more powerful and data-specific methods of predictions that we did not implement at the moment.
The full results of our linear regression are below.
# output regression results table
# TODO: recode and/or label variables to make this not look terrible
stargazer(reg.training, type = "html")
| Dependent variable: | |
| logPrice | |
| year_quarter2019-2 | 0.039*** |
| (0.011) | |
| year_quarter2019-3 | 0.048*** |
| (0.012) | |
| year_quarter2019-4 | 0.042*** |
| (0.012) | |
| year_quarter2020-1 | 0.044*** |
| (0.012) | |
| year_quarter2020-2 | 0.075*** |
| (0.012) | |
| year_quarter2020-3 | 0.118*** |
| (0.011) | |
| year_quarter2020-4 | 0.151*** |
| (0.012) | |
| year_quarter2021-1 | 0.232*** |
| (0.012) | |
| year_quarter2021-2 | 0.310*** |
| (0.012) | |
| CompCode | 0.278*** |
| (0.030) | |
| EffectiveYear | 0.004*** |
| (0.0002) | |
| bsmtSF | -0.00001 |
| (0.00001) | |
| carStorageSF | 0.0001*** |
| (0.00002) | |
| nbrBedRoom | 0.008** |
| (0.004) | |
| nbrRoomsNobath | 0.008*** |
| (0.002) | |
| mainfloorSF | 0.0001*** |
| (0.00001) | |
| nbrThreeQtrBaths | 0.031*** |
| (0.005) | |
| nbrFullBaths | 0.023*** |
| (0.005) | |
| nbrHalfBaths | 0.027*** |
| (0.006) | |
| TotalFinishedSF | 0.0001*** |
| (0.00001) | |
| constMatConcrete | 0.059 |
| (0.121) | |
| constMatFrame | 0.041* |
| (0.025) | |
| constMatMasonry | 0.049** |
| (0.024) | |
| constMatMissing | -0.158*** |
| (0.049) | |
| constMatPrecast | -0.090 |
| (0.184) | |
| constMatUnspecified | -0.365*** |
| (0.079) | |
| constMatVeneer | -0.568** |
| (0.260) | |
| constMatWood | 0.099 |
| (0.139) | |
| hasBasement | 0.060*** |
| (0.009) | |
| hasGarage | 0.043*** |
| (0.011) | |
| hasAC | 0.022*** |
| (0.006) | |
| heatingTypeElectric Wall Heat (1500W) | 0.558*** |
| (0.195) | |
| heatingTypeForced Air | 0.038** |
| (0.017) | |
| heatingTypeGravity | 0.079 |
| (0.055) | |
| heatingTypeHeat Pump | -0.103 |
| (0.091) | |
| heatingTypeHot Water | 0.090*** |
| (0.019) | |
| heatingTypeNo HVAC | -0.084 |
| (0.181) | |
| heatingTypeNone | -0.252*** |
| (0.033) | |
| heatingTypePackage Unit | -0.330 |
| (0.296) | |
| heatingTypeRadiant Floor | 0.147*** |
| (0.032) | |
| heatingTypeUnspecified | 0.826*** |
| (0.122) | |
| heatingTypeVentilation Only | -0.774*** |
| (0.267) | |
| heatingTypeWall Furnace | -0.042 |
| (0.036) | |
| extWallBrick on Block | 0.029 |
| (0.041) | |
| extWallBrick Veneer | -0.008 |
| (0.035) | |
| extWallCedar | -0.030 |
| (0.101) | |
| extWallCement Board | -0.057 |
| (0.036) | |
| extWallFaux Stone | 0.035 |
| (0.050) | |
| extWallFrame Asbestos | -0.023 |
| (0.253) | |
| extWallFrame Stucco | 0.015 |
| (0.035) | |
| extWallFrame Wood/Shake | -0.001 |
| (0.035) | |
| extWallLog | 0.173*** |
| (0.046) | |
| extWallMetal | 0.005 |
| (0.095) | |
| extWallMissing | -0.261** |
| (0.108) | |
| extWallMoss Rock/Flagstone | 0.022 |
| (0.041) | |
| extWallPainted Block | 0.121** |
| (0.062) | |
| extWallStrawbale | 0.399** |
| (0.192) | |
| extWallVinyl | 0.025 |
| (0.047) | |
| extWall2Brick on Block | -0.406*** |
| (0.125) | |
| extWall2Brick Veneer | -0.399*** |
| (0.096) | |
| extWall2Cedar | -0.235 |
| (0.174) | |
| extWall2Cement Board | -0.388*** |
| (0.113) | |
| extWall2Faux Stone | -0.369*** |
| (0.097) | |
| extWall2Frame Asbestos | -0.378*** |
| (0.140) | |
| extWall2Frame Stucco | -0.383*** |
| (0.098) | |
| extWall2Frame Wood/Shake | -0.386*** |
| (0.096) | |
| extWall2Log | -0.248 |
| (0.158) | |
| extWall2Metal | -0.297** |
| (0.121) | |
| extWall2Moss Rock/Flagstone | -0.314*** |
| (0.099) | |
| extWall2None | -0.389*** |
| (0.096) | |
| extWall2Painted Block | 0.089 |
| (0.175) | |
| extWall2Vinyl | -0.403*** |
| (0.127) | |
| intWallHardwood Panel | 0.002 |
| (0.020) | |
| intWallMissing | -0.017 |
| (0.037) | |
| intWallPlaster | 0.049*** |
| (0.016) | |
| intWallPlywood | -0.044 |
| (0.045) | |
| intWallUnfinished | -0.056 |
| (0.042) | |
| intWallWall Board | -0.039 |
| (0.024) | |
| roofTypeConcrete Tile | 0.008 |
| (0.020) | |
| roofTypeMetal | 0.047** |
| (0.021) | |
| roofTypeMissing | 0.0004 |
| (0.006) | |
| roofTypeOther | -0.044* |
| (0.026) | |
| roofTypeRubber Membrane | 0.092*** |
| (0.022) | |
| qualityNum | 0.052*** |
| (0.002) | |
| builtEra1920s | 0.004 |
| (0.029) | |
| builtEra1930s | 0.014 |
| (0.032) | |
| builtEra1940s | -0.001 |
| (0.028) | |
| builtEra1950s | -0.068*** |
| (0.026) | |
| builtEra1960s | -0.052** |
| (0.025) | |
| builtEra1970s | -0.095*** |
| (0.025) | |
| builtEra1980s | -0.125*** |
| (0.025) | |
| builtEra1990s | -0.148*** |
| (0.026) | |
| builtEra2000s | -0.161*** |
| (0.026) | |
| builtEra2010s | -0.155*** |
| (0.028) | |
| builtEra2020s | -0.248*** |
| (0.031) | |
| builtEraPre-1910 | 0.027 |
| (0.028) | |
| manySections | 0.856*** |
| (0.032) | |
| designType2-3 Story | -0.013 |
| (0.010) | |
| designTypeBi-level | 0.026* |
| (0.016) | |
| designTypeMulti-Story Townhouse | -0.196*** |
| (0.013) | |
| designTypeSplit-level | 0.015 |
| (0.011) | |
| pctWhite | 0.212*** |
| (0.075) | |
| pctVacant | -0.279*** |
| (0.053) | |
| pctOwnerOccupied | -0.034 |
| (0.028) | |
| pctHigherEdu | 0.108*** |
| (0.039) | |
| pctIncomeBelow100k | 0.00001 |
| (0.045) | |
| pctIncomeAbove200k | 0.146*** |
| (0.049) | |
| wildfireHistory | 0.031*** |
| (0.004) | |
| floodRisk | -0.028** |
| (0.013) | |
| dispensaryDistance | 0.00000 |
| (0.00000) | |
| wholeFoodsDistance | -0.00002*** |
| (0.00000) | |
| highwayDistance | 0.00001** |
| (0.00000) | |
| tract08013012102 | -0.247*** |
| (0.027) | |
| tract08013012103 | -0.244*** |
| (0.034) | |
| tract08013012104 | -0.253*** |
| (0.034) | |
| tract08013012105 | -0.326*** |
| (0.033) | |
| tract08013012201 | -0.022 |
| (0.036) | |
| tract08013012202 | -0.179*** |
| (0.046) | |
| tract08013012203 | -0.448*** |
| (0.039) | |
| tract08013012204 | -0.078 |
| (0.065) | |
| tract08013012300 | -0.007 |
| (0.180) | |
| tract08013012401 | -0.226*** |
| (0.040) | |
| tract08013012501 | -0.395*** |
| (0.046) | |
| tract08013012505 | -0.117*** |
| (0.036) | |
| tract08013012507 | -0.367*** |
| (0.040) | |
| tract08013012508 | -0.339*** |
| (0.045) | |
| tract08013012509 | -0.267*** |
| (0.040) | |
| tract08013012510 | -0.111*** |
| (0.037) | |
| tract08013012511 | -0.453*** |
| (0.047) | |
| tract08013012603 | -0.354*** |
| (0.038) | |
| tract08013012605 | -0.281*** |
| (0.107) | |
| tract08013012607 | -0.308*** |
| (0.067) | |
| tract08013012608 | -0.385*** |
| (0.041) | |
| tract08013012701 | -0.448*** |
| (0.035) | |
| tract08013012705 | -0.551*** |
| (0.048) | |
| tract08013012707 | -0.274*** |
| (0.053) | |
| tract08013012708 | -0.556*** |
| (0.036) | |
| tract08013012709 | -0.520*** |
| (0.047) | |
| tract08013012710 | -0.310*** |
| (0.038) | |
| tract08013012800 | -0.720*** |
| (0.036) | |
| tract08013012903 | -0.624*** |
| (0.042) | |
| tract08013012904 | -0.604*** |
| (0.036) | |
| tract08013012905 | -0.534*** |
| (0.045) | |
| tract08013012907 | -0.595*** |
| (0.038) | |
| tract08013013003 | -0.480*** |
| (0.037) | |
| tract08013013004 | -0.484*** |
| (0.041) | |
| tract08013013005 | -0.403*** |
| (0.039) | |
| tract08013013006 | -0.432*** |
| (0.036) | |
| tract08013013201 | -0.547*** |
| (0.052) | |
| tract08013013202 | -0.241*** |
| (0.053) | |
| tract08013013205 | -0.548*** |
| (0.036) | |
| tract08013013207 | -0.817*** |
| (0.041) | |
| tract08013013208 | -0.847*** |
| (0.040) | |
| tract08013013210 | -0.803*** |
| (0.040) | |
| tract08013013211 | -0.793*** |
| (0.034) | |
| tract08013013212 | -0.723*** |
| (0.036) | |
| tract08013013213 | -0.788*** |
| (0.034) | |
| tract08013013302 | -0.711*** |
| (0.039) | |
| tract08013013305 | -0.786*** |
| (0.042) | |
| tract08013013306 | -0.783*** |
| (0.047) | |
| tract08013013307 | -0.822*** |
| (0.044) | |
| tract08013013308 | -0.816*** |
| (0.043) | |
| tract08013013401 | -0.817*** |
| (0.045) | |
| tract08013013402 | -0.870*** |
| (0.038) | |
| tract08013013503 | -0.837*** |
| (0.043) | |
| tract08013013505 | -0.811*** |
| (0.054) | |
| tract08013013506 | -0.869*** |
| (0.042) | |
| tract08013013507 | -0.833*** |
| (0.043) | |
| tract08013013508 | -0.801*** |
| (0.038) | |
| tract08013013601 | -0.463*** |
| (0.047) | |
| tract08013013602 | -0.330*** |
| (0.058) | |
| tract08013013701 | -0.464*** |
| (0.030) | |
| tract08013013702 | -0.422*** |
| (0.036) | |
| tract08013060600 | -0.621*** |
| (0.037) | |
| tract08013060700 | -0.590*** |
| (0.040) | |
| tract08013060800 | -0.599*** |
| (0.038) | |
| tract08013060900 | -0.618*** |
| (0.037) | |
| tract08013061300 | -0.740*** |
| (0.039) | |
| tract08013061400 | -0.721*** |
| (0.038) | |
| Constant | 4.747*** |
| (0.507) | |
| Observations | 11,261 |
| R2 | 0.789 |
| Adjusted R2 | 0.785 |
| Residual Std. Error | 0.248 (df = 11081) |
| F Statistic | 230.810*** (df = 179; 11081) |
| Note: | p<0.1; p<0.05; p<0.01 |